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RIX ON - THE NVIDIA TURING GPU 
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INTRODUCING TURING 


Greatest Leap Since 2006 LUDA БРО 
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10 Giga Rays/sec 


Ray Triangle Intersection 


BVH Traversal 


INTRODUCING TURING 
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NVLINK CHANNELS 2 


NVIDIA TURING GPU - NEW EFFICIENT SM 


Turing SM >1.5х Pascal 5M Performance 
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TURING SM 


Concurrent FP & INT Execution Datapaths 
Enhanced L1 cache 
Uniform Datapath & RF 


1 warp = 32 threads 


L2 Š 


TURING SM 
MICROARCHITECTURE 


Evolved for Efficiency 


Built on foundation of Volta SM 
(V100: HPC/Datacenter solution between Pascal and Turing 
Architectures: see HotChips2017 talk) 


Compared to Pascal, Turing provides: 
> Twice the schedulers 
> Simplified issue logic 


> Large, fast L1 cache unified 
with TEX $ and Shared Memory 


NEW CACHE & SHARED MEM ARCHITECTURE 


PASCAL 


SM Global Shared Sub-core Shared Global 
Data Data Instructions Data Data 
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Crossbar 
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Evolved for Efficiency 
TURING 


Global/ Global/ Global/ Global/ 
SM Texture Shared  Sub-core Shared SM Texture Shared  Sub-core Shared 
Data Data Instructions Data Data Data Instructions Data 
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МО МО 
Scheduler Scheduler 


Compared to Pascal: 


2x L1 Bandwidth 
Lower L1 Hit Latency 
Up to 2.7x L1 Capacity 
2x L2 Capacity 


MIO Datapath 
64 B/clk 


MIO Scheduler 
1 warp instr/4 clk 


TURING SM 
MICROARCHITECTURE 


Evolved for Efficiency 


Compared to Pascal: 


> Twice the register file capacity 
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Improved SIMT model & branch unit 
Concurrent FP and INT execution 
New Uniform registers and datapath 


New Tensor Core 


16x8x8 FP16 tensor/8 clk 
8x8x16 INT8 tensor/4 clk 
8x8x32 INT4 tensor/4 clk 


Fast FP16 math 


(INT + FP instructions) / FP instructions 
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CONCURRENT 
EXECUTION 


Far Cry 5 
GTA V 

RoTR 

SoW 

The Division 
Witcher 


Per 100 FP instructions, 
average 36 INT PIPE instructions 
(ie iadd, select, fp min/max, compare etc) 


GeoMean 


Register File (16,384 x 32-bit) 
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Register File (16,384 x 32-bit) 


TENSOR 
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RT CORE 


UNIFORM DATAPATH & TRADITIONAL SIMD/VECTOR 
е Е (5 | с Т Е R F | L Е Scalar Thread SIMD/Vector lanes 


Goal: Exploit redundant computation & data across multiple 
threads while preserving our Independent Thread Scheduling model 
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Automatically promote ops/data when warp-uniform data 
is detected Control flow ----» Vector execution mask 


> Compiler + hardware assist 


> Executed by an independent datapath | TURING SIMT 


6 М M > 
> “Reverse vectorization Uniform op/data SIMT threads 


Example: Enabling DX12 bindless constants with URF/UDP on шев --- 
өөө 


Forza MS7 yielded +12.7% performance 
Diverged 


thread 
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UIADD3 UR13, UR9, 0x300001, URZ 

ULDC.64 UR20, [UR6 + 0x18], !UP7 

UIADD3 UR6, UR8, UR10, URZ 

UIADD3 UR8, UR9, 0x300002, URZ 

FSETP.NEU.FTZ.AND P1, PT, R15, cx[UR20][0x64], РТ 

ULOP3 LUT UR12, UR13, Oxfffff, URZ, Охс0, !UP7 ! 


TURING SHADING PERFORMANCE VS PASCAL 


>30% Improved Performance per Lore 


Relative Shader Performance 


1.0X 
Example shader VRMark Sniper Elite 4 Deus Ex SoW 3DMark RoTR 


NVIDIA TURING GPU - NEW TENSOR CORE 


Turing Tensor Core for Real-time Inference 
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455 TOPS INT4 


TENSOR CORE 


Breakthrough Acceleration for Computation of Matrix Multiplies 


114 TFLOPS FP16 
228 TOPS INT8 
455 TOPS INT4 


“GTX 2080 Ti 
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Math кА Unit 
1 warp instr/clk 


MIO Datapath 
64 B/clk 


MIO Scheduler 
1 warp instr/4 clk 


TENSOR CORE 


Breakthrough Acceleration for 
Computation of Matrix Multiplies 


Multi-thread collaborative matrix math operation 


> Sharing operands across threads saves RF 
and shared memory BW 


Fine-grained integration inside SM 


> Provides maximum algorithmic flexibility 


> Different activation functions, 
Batch norm variants, etc. 


> Leverages huge storage capacity and BW 
provided by RF and shared mem/L1$ 


8b & 4b integer support with 32b accumulation 
for maximum inference performance 


DEEP LEARNING INFERENCE ON TESLA T4 


Up to 36X Faster Than LPUs | Accelerates All Al Workloads 


Peak Performance Speech Inference Video Inference Language Inference 


260 27X 
10X 
| i 
22 
1.0 1.0 1.0 


float int8 float int8 int4 = CPU Server ш Tesla P4 m Tesla Т4 m CPU Server # Tesla P4 m Tesla T4 m CPU Server м Tesla P4 m Tesla T4 


P4 T4 
Speedup: 21X Faster Speedup: 27X Faster Speedup: 36X Faster 
DeepSpeech 2 ResNet-50 (7ms latency limit) Natural Language Processing 
GNMT Model 
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Speedup v. CPU Server 
Speedup v. CPU Server 
Speedup v. CPU Server 


ENDLESS POSSIBILITIES OF DEEP LEARNING 


Deep Learning Disruption in Gaming and Professional Graphics 


MATERIAL & ART ENHANCEMENT 


VOICE COMMANDS 


Al SLOW MOTION VIDEO 


ET Ear 
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STYLE TRANSFER 8 CONTENT CREATION: 
GauGAN 


FACIAL & CHARACTER ANIMATION 


Е / 


NEW RT CORE 


Turing RI X is 7x Pascal Ray Tracing Performance 


NVIDIA TURING GPU - 
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Сава а така 


First Ray Tracing GPU 


10 Giga Rays/sec 


Ray Triangle Intersection 


BVH Traversal 


RTX - RAY TRACING 
ACCELERATED 


Real-time Ray Tracing has Arrived 
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Attack from Outer Space UE4 Demo by Christian Hecht 


PATH TRACED GLOBAL ILLUMINATION 


Simulate Physically Based Light Transport by Tracing Photons’ with Rays 


Commonly used for CGI in films 


> But many hours to produce 
final images on CPU 


Fundamental building blocks 
> Sampling 
> Traversal and Intersection 


> Material evaluation 


ο 
1. Primary -%- 
2. Reflection / Refraction “T 
3. Direct local light / shadow 7 
4. Direct sun light / shadow EN 
5. Indirect bounce 6 ew 4 
6. Indirect local light / shadow Z — -@- 


7. Indirect sun light / shadow "m 3 
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PRE-RTX GPU RAY TRACING 


Software Emulation for Ray/Geometry Intersection Search 
Pascal SM Shaders 


Launch Ray Probe 


Fetch box 
Decode box 


Intersection test 
Sub-box or tris? 
| Many thousands 


of instruction slots 


Ray/triangle рег тау 
intersection test 
Return hit 


Shading 


TURING RAY TRACING WITH RT CORES 


Hardware Acceleration Replaces Software Emulation 


Turing SM Shaders RT Core Box Intersection 
=a Evaluators 
Fetch box 
Decode box 
Intersection test 
Sub-box or tris? 


Shading 


Triangle Intersection 
Evaluators 
(E.g. material 
evaluation, 
importance 
sampling, 
denoising, 
custom 
intersection, 
etc.) 


RT CORE 


PASCAL 
GTX 1080Ti 
202 ms 

5 fps 


TURING 

RTX 2080 

NO RT CORES 
97 ms 

10 fps 


TURING RTX 
RTX 2080 

RT CORES ON 
29 ms 

34 fps 


ONE QUAKE II RTX FRAME 


Breakthrough Acceleration Enables Real-time Path Tracing 


7X speedup 


B FP32 Cores М INT32 Cores Ш RT Cores M Other Graphics ІШ Memory 


REAL-TIME RAY TRACING IS HERE 


GAMES 


Most Anticipated Games | Biggest Franchises 
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DRAGON HOUND FNUISTEDNET 


NSH RGA MIRO | v | TCMB'RAIDER WATCH DOGS: Wojfensfein 


RTX 


ENGINES AND APIs 


Support in all Major Game Engines 


кап. DirectX 


NVIDIA TURING GPU 
(sreater Than the Sum of Its Parts 
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RT Core 


10 Giga Rays/sec 


Ray Triangle Intersection 


BVH Traversal 


PROFESSIONAL RENDERING 
ON QUADRO RTX 


УМ + К ГСоге + TensorCore = чевецегатес Ray 


Tracing and Al Denoising al ра 


NVIDIA TURING GPU 


Evolved for Efficiency and Breakthrough Acceleration 


TURING GPU SM CORE TENSOR CORE RT CORE 
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Next Gen Graphics Realized >1.5x Faster SM Real-time Inference 27x Faster Ray Tracing 


More Turing features: GDDR6, Variable Rate Shading, Mesh Shading, Post-L2 Cache Data Compression, 
NVLINK Connectivity, USBC, and many more... 


THANK YOU - QUESTIONS? 


